Guarding against Spurious Discoveries in High Dimensions
نویسندگان
چکیده
Many data-mining and statistical machine learning algorithms have been developed to select a subset of covariates to associate with a response variable. Spurious discoveries can easily arise in high-dimensional data analysis due to enormous possibilities of such selections. How can we know statistically our discoveries better than those by chance? In this paper, we define a measure of goodness of spurious fit, which shows how good a response variable can be fitted by an optimally selected subset of covariates under the null model, and propose a simple and effective LAMM algorithm to compute it. It coincides with the maximum spurious correlation for linear models and can be regarded as a generalized maximum spurious correlation. We derive the asymptotic distribution of such goodness of spurious fit for generalized linear models and L1-regression. Such an asymptotic distribution depends on the sample size, ambient dimension, the number of variables used in the fit, and the covariance information. It can be consistently estimated by multiplier bootstrapping and used as a benchmark to guard against spurious discoveries. It can also be applied to model selection, which considers only candidate models with goodness of fits better than those by spurious fits. The theory and method are convincingly illustrated by simulated examples and an application to the binary outcomes from German Neuroblastoma Trials.
منابع مشابه
Guarding from Spurious Discoveries in High Dimension
Many data-mining and statistical machine learning algorithms have been developed to select a subset of covariates to associate with a response variable. Spurious discoveries can easily arise in high-dimensional data analysis due to enormous possibilities of such selections. How can we know statistically our discoveries better than those by chance? In this paper, we define a measure of goodness ...
متن کاملA total variation diminishing high resolution scheme for nonlinear conservation laws
In this paper we propose a novel high resolution scheme for scalar nonlinear hyperbolic conservation laws. The aim of high resolution schemes is to provide at least second order accuracy in smooth regions and produce sharp solutions near the discontinuities. We prove that the proposed scheme that is derived by utilizing an appropriate flux limiter is nonlinear stable in the sense of total varia...
متن کاملSpurious Hyperleukocytosis
Hyperleukocytosis is an oncological emergency but is extremely rare in non-malignant conditions. Nucleated RBCs give rise to spuriously high total leucocyte count and cause clinical dilemma. Thalassemia major patients are known to have leucocytosis even after correction for nucleated RBCs. We report a case of undiagnosed Thalassemia major in a 4 month old infant with total leucocyte count highe...
متن کاملCategories or dimensions: lessons learned from a taxometric analysis of Adult Attachment Interview data.
Booth-LaForce and Roisman's monograph on the Adult Attachment Interview (AAI) featured a taxometric analysis to determine whether variation along two components, dismissing and preoccupied states of mind, was categorical or dimensional. Empirically evaluating the latent structure of these constructs helps to avoid spurious categories or dimensions. This benefits researchers working with measure...
متن کاملMate-guarding courtship behaviour: tactics in a changing world
http://dx.doi.org/10.1016/j.anbehav.2014.08.007 0003-3472/© 2014 The Association for the Study of A Mate guarding is one of the most common tactics in sperm competition. Males are expected to guard their mates when costs of guarding (accrued from physical confrontations with rivals and/or reduced foraging) are low relative to the benefits of ensuring mating opportunities and paternity. We inves...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Journal of machine learning research : JMLR
دوره 17 شماره
صفحات -
تاریخ انتشار 2016